Viewpoint
Abstract
Synthetic data generation (SDG) structured health data is increasingly promoted as a solution to longstanding barriers in health data access. It is offering the promise of privacy-preserving data reuse for research, innovation, and policy. Despite rapid technical advances, the adoption of synthetic health data in real-world settings remains limited. Shaped by challenges around data quality, representativeness, infrastructure readiness, trust, and legal uncertainty, this viewpoint draws on experiences from 7 European research initiatives within the HealthData4EU cluster to reflect on how SDG is being operationalized in practice. It synthesizes cross-project insights to highlight recurring methodological and governance tensions and to examine their implications for trust and responsible use. The analysis argues that trustworthy SDG cannot be achieved through technical optimization alone but requires alignment between evaluation practices, upstream data stewardship, regulatory clarity, and sustained stakeholder engagement. Addressing these conditions is essential for moving synthetic data from experimental pilots toward a credible and sustainable component of European health research ecosystems.
J Med Internet Res 2026;28:e83369doi:10.2196/83369
Keywords
Introduction
As health care systems are largely digitized and become increasingly data-driven, access to large-scale, high-quality data is essential for clinical research [-], digital health innovation [], and evidence-based policymaking []. However, accessing patient-level health data remains a persistent challenge []. Privacy regulations [], data quality issues [,], and data fragmentation [] continue to limit data access and sharing, especially in cross-organizational studies [,]. These barriers not only slow innovation but also risk underrepresenting certain diseases and populations, such as rare conditions, in secondary data use. As a result, many promising analytical and clinical applications struggle to move beyond pilots.
Synthetic data generation (SDG) has emerged as a promising response to the barriers limiting access and reuse of patient-level health data []. By producing synthetic datasets that resemble the statistical and structural properties of real health data, without exposing personal identities, SDG can drive innovation while maintaining privacy []. It enables safer collaboration across different types of health data (eg, electronic health records, genomics, medical imaging) and offers value in areas such as artificial intelligence (AI) development [,], rare disease research [], and model testing [,]. The main aim of SDG is to retain the essential characteristics of original datasets while retaining privacy (in most use cases). These can be grouped into quantitative [] (eg, statistical metrics, analytical patterns, signal-based features), qualitative [] (eg, expert-driven assessments), and domain-specific attributes [] that depend on the data type and specific use case. Conceptual work has also emphasized the importance of incorporating domain knowledge into SDG processes to improve realism and relevance for clinical applications []. Despite this growing interest, the adoption of synthetic health data in real-world research and clinical environments remains limited [,]. Concerns persist regarding data quality, representativeness, interpretability, and legal status [,], as well as the uncertainty about how synthetic data exceed what current methods and data infrastructures can deliver []. This gap between promise and practice has contributed to skepticism among multiple stakeholders, for whom trust in data sources is essential []. This lack of trust is partly explained by the focus of SDG literature on technical methods and evaluation metrics [,,] while giving limited attention to the organizational, legal, and sociotechnical conditions that shape real-world adoption [,].
Several EU-funded initiatives are currently testing SDG in real-world health research settings, all facing a shared question: how can synthetic health data be made trustworthy, useful, and safe? Despite this activity, there has been limited cross-initiative reflection on how these challenges are being addressed in practice.
The aim of this viewpoint is to examine why trust in synthetic health data remains fragile despite growing technical maturity and to reflect on the nontechnical conditions required for its responsible adoption. This article is intended for health data researchers, clinicians, data stewards, infrastructure developers, regulators, and policymakers engaged in secondary use of health data across Europe. This article provides a viewpoint informed by experiences from 7 European research initiatives within the HealthData4Eu cluster that work with structured synthetic health data. It synthesizes cross-project insights to highlight recurring challenges related to data quality, interoperability, regulatory alignment, and stakeholder acceptance and reflects on their implications for trust, governance, and responsible use of synthetic data in European health research.
Basis of This Viewpoint
Scope and Context of the Included Projects
This viewpoint is informed by 7 EU-funded projects that constitute the HealthData4EU cluster, a Horizon Europe initiative focused on advancing SDG and secure data use in health research across Europe. Collectively, these projects address complementary aspects of the synthetic data landscape, including data generation methods, federated and privacy-preserving architectures, secure data sharing, and AI-enabled clinical applications. All operate within a shared European policy and funding framework while engaging with diverse clinical and institutional contexts relevant to the European Health Data Space.
The included initiatives represent the full set of synthetic data projects participating in the HealthData4EU cluster at the time of writing. While they operate within a shared European policy and funding framework, they span a wide range of clinical domains, data modalities (eg, tabular, imaging, longitudinal, multimodal data), and technical approaches. This diversity within a coordinated structure provides a useful basis for reflecting on how SDG is being operationalized across health care settings and where common challenges and tensions emerge. An overview of the participating initiatives with their clinical domains is provided in .
| Project | Clinical domain(s) | Example use cases |
| AISym4MED | Neurological and chronic diseases (ie, diffuse large B-cell lymphoma, lung cancer, multiple myeloma, Alzheimer disease, type 2 diabetes, breast cancer) |
|
| FLUTE | Oncology (prostate cancer) |
|
| PHASE IV AI | Oncology and neurology (ie, lung cancer, prostate cancer, ischemic stroke) |
|
| PHEMS | Pediatric diseases (ie, congenital cardiac conditions, sepsis, hemophilia) |
|
| SECURED | Multiple domains (i.e.ie, mammography, histopathology, chest radiography, cardiotocography) |
|
| SYNTHEMA | Rare hematological diseases (ie, sickle-cell disease, acute myeloid leukemia) |
|
| SYNTHIA | Multimodal and personalized medicine |
|
aAI: artificial intelligence.
bSDG: synthetic data generation.
The insights synthesized in this paper draw on cross-project engagement conducted during the active lifetime of the initiatives. These insights were informed by cluster-level workshops, project presentations, and recurring exchanges among project teams and by review of the project documentation during the active implementation phase. Importantly, these interactions took place while projects were still ongoing, allowing challenges and emerging practices to be discussed as they arose rather than retrospectively. The exchanges involved multidisciplinary contributors, including technical developers, clinicians, data stewards, infrastructure providers, and legal and ethical experts.
While these insights are not derived form a formal empirical study, they reflect sustained, longitudinal engagement with projects that are actively developing and validating the real-world use of synthetic health data within regulated clinical and research environments. To improve transparency and address requests for additional project-level context, we have incorporated a table in providing a descriptive overview of the included projects.
Lessons From 7 SDG Projects
A Shared Landscape of Methodological Tensions
Across projects, SDG is not pursued as a single technical task but as a complex sociotechnical intervention embedded in heterogeneous health systems []. In practice, SDG sits at the intersection of machine learning, clinical data infrastructures, regulatory governance, and institutional trust.
Despite differences in disease focus, data modalities, and maturity, projects consistently encounter several recurring challenges that shape how SDG is operationalized in practice. Across the initiatives examined in this paper, 3 themes emerged repeatedly: the absence of shared consensus on how synthetic data quality should be defined and evaluated, the limitations of existing data infrastructures to support reliable SDG workflows, and uncertainty surrounding the regulatory and governance status of synthetic data. These challenges reflect the structural characteristics of health data ecosystems, where technical innovation must coexist with legal safeguards, clinical accountability, and fragmented infrastructures. Understanding SDG through these challenges helps explain why synthetic data adoption remains difficult to scale beyond pilot environments.
Data Quality Without Consensus
A central tension concerns how “data quality” in synthetic health data is defined and demonstrated. In practice, data quality is assessed through overlapping but fragmented lenses: statistical fidelity, analytical or clinical utility, and privacy protection. These dimensions are widely acknowledged across initiatives, yet there is no shared understanding of how they should be balanced, prioritized, or interpreted in different contexts [].
Existing tools and metrics often assume that real-world source data are themselves unbiased, complete, and representative [,,,]. This assumption is rarely valid in health systems shaped by structural inequities, fragmented data capture, and variable clinical practices across institutions and regions. When real data reflect structural inequities or incomplete capture, synthetic data derived from them may inherit the same distortions [].
As a result, SDG risks reinforcing existing biases rather than mitigating them. While synthetic data are frequently promoted as a mechanism for improving representativeness or enabling rare disease research, representational properties are frequently assessed primarily at the level of the resulting dataset []. In practice, however, representativeness is shaped not only by the quality of the synthetic generation process but also by the characteristics of the underlying source data and the transformations applied during the data lifecycle (eg, data extraction, harmonization, mapping to common data models, other extraction transfer load steps) []. These can introduce or amplify biases by selectively filtering variables, standardizing heterogeneous data sources, or excluding smaller subpopulations. These upstream influences may remain difficult to detect when evaluation focuses primarily on statistical similarity. As a result, population coverage and the potential impact of data preparation processes on representation may receive less attention in current evaluation practices.
In this sense, data quality cannot be reduced to statistical resemblance alone. We argue that there is a need for a more explicit data quality assessment at source and across the data lifecycle. In that sense, results need to be interpreted in relation to the populations, decisions, and governance contexts in which synthetic data are intended to be used.
Infrastructure as a Bottleneck to Trust
Federated and decentralized SDG architectures are frequently presented as solutions to privacy and governance constraints []. In practice, infrastructure maturity at data sources emerges as a critical limiting factor. Heterogeneous hospital IT systems, inconsistent data models, and uneven data governance practices complicate federated training and validation. These challenges affect not only technical integration but also the reliability of data inputs and the feasibility of data quality assessments.
Interoperability challenges persist even when projects adopt common data models such as the Observational Medical Outcomes Partnership Common Data Model. Differences in local coding practices, laboratory workflows, and semantic interpretation (eg, variation in LOINC [Logical Observation Identifiers Names and Codes] mappings for similar laboratory measurements) require substantial upfront coordination. These nuances complicate federated validation and cross-site comparability, reinforcing that standardization alone does not eliminate heterogeneity in real-world health data. Within the HealthData4EU cluster, several projects reported that aligning local coding practices and laboratory mappings across institutions required substantial preprocessing effort before federated SDG or validation could be performed.
These infrastructural constraints shape what forms of SDG are feasible and who can participate, reinforcing a gap between technical potential and institutional readiness.
Trust, Transparency, and Regulatory Uncertainty
Trust is Not Guaranteed by Compliance
Most initiatives emphasize compliance with data protection and ethical frameworks. While compliance is necessary, it is not sufficient to generate trust. Trust emerges through transparency, interpretability, and meaningful stakeholder engagement across the SDG lifecycle [].
In practice, these dimensions are unevenly operationalized. Some initiatives invest in dataset documentation, model cards, and participatory validation involving clinicians or data stewards. Others limit engagement to formal approval processes or internal review. This variability affects adoption, particularly in clinical and regulatory contexts where accountability and explainability are essential.
In the European context, this includes emerging obligations under the EU Artificial Intelligence Act, specific for high-risk medical AI systems, where the provenance and validation of training data, including synthetic data, may become subject to increased scrutiny.
The Unresolved Legal Status of Synthetic Data
Regulatory ambiguity remains one of the most significant barriers to uptake. Synthetic data are often assumed to fall outside data protection regulation, yet institutions frequently adopt conservative interpretations, particularly of rare diseases or small cohorts []. The absence of harmonized guidance leaves data controllers and ethics boards navigating uncertainty case by case.
Clarification is particularly needed in relation to General Data Protection Regulation (GDPR) anonymization thresholds, the interaction with the EU Artificial Intelligence Act, and national interpretations by data protection authorities and ethics committees.
This uncertainty raises a fundamental but often unspoken question: what is the value of synthetic data if it cannot be confidently reused beyond its original context? If synthetic datasets remain confined to local pilots due to legal or institutional caution, their potential to support cross-border research, reproducibility, and capacity building is limited. Addressing reusability therefore becomes central to the trust debate, linking legal clarity, documentation, and transparent validation of the broader promise of synthetic data as a shared research asset rather than a one-off technical item.
Emerging Responses and Their Limits
Encouragingly, initiatives are beginning to respond through federated validation pipelines, benchmarking libraries, interoperability standards, and public-facing platforms. However, these responses remain fragmented, and no shared consensus has yet emerged on what constitutes trustworthy SDG across contexts.
Reflections and Recommendations
The experiences discussed in this article suggest that SDG in health research has reached a point where technical feasibility is no longer the primary bottleneck. Instead, progress increasingly depends on how synthetic data are evaluated, governed, communicated, and trusted in practice.
First, there is a need to reframe how success in SDG is defined and assessed. Much of the current emphasis remains on demonstrating technical performance, such as statistical similarity or model accuracy. While these measures are important, they do not on their own indicate whether synthetic data are appropriate for research or operational contexts. In practice, synthetic datasets may serve different purposes (eg, exploratory analysis, educational use, or hypothesis testing). These uses imply different expectations regarding the level of fidelity, reliability, and validation required. Clarifying the intended context of use can therefore help guide how synthetic datasets are evaluated and interpreted. This does not imply that all potential analytical applications must be evaluated in advance, but rather that evaluation strategies should be transparent about the contexts for which synthetic datasets are considered suitable. At the same time, general evaluation metrics remain essential to characterize the overall fidelity and privacy properties of synthetic datasets. In many cases, findings derived from exploratory analysis using synthetic data may subsequently require validation using the original data sources where appropriate access mechanisms exist.
Second, greater attention should be given to data stewardship and governance upstream of SDG. Synthetic data reflects the characteristics and limitations of the infrastructures from which it is generated. Without robust practices for data quality management, interoperability, and documentation at source, SDG risks perpetuating existing shortcomings rather than addressing them. Aligning SDG initiatives with broader efforts to strengthen health data infrastructures would therefore increase both the reliability of synthetic outputs and institutional confidence in their reuse.
Third, addressing the legal uncertainty surrounding synthetic health data is essential for wider adoption. In the absence of shared guidance, institutions often adopt cautious positions that limit reuse, particularly in sensitive or cross-border contexts. Developing common interpretations, best practices, or certification pathways that clarify when synthetic data can be considered sufficiently anonymized would provide a more stable foundation for decision-making. Such approaches would not eliminate the need for contextual oversight, but they would reduce fragmentation and support more consistent application of safeguards.
Finally, trust building must be recognized as an active and ongoing process. Trust does not arise automatically from compliance with legal or ethical requirements nor from technical sophistication alone. It depends on transparency, interpretability, and the ability of stakeholders to understand how synthetic data were generated, evaluated, and intended to be used. Embedding mechanisms such as clear documentation, accessible validation summaries, and engagement with clinicians, data stewards, and other end users can support accountability and enable more nuanced judgments about risk and value.
Overall, these reflections point to a broader shift in focus: from demonstrating that synthetic data can be generated to establishing the conditions under which it can be used responsibly, communicated transparently, and governed consistently. Advancing SDG along this path will require coordination across technical, institutional, and regulatory domains. If achieved, synthetic data can become not only a powerful analytical tool but also a credible and trusted component of health research ecosystems.
Acknowledgments
The authors would like to thank all 7 HealthData4EU projects for their valuable input, collaboration, and contributions to this paper. Their openness and insights were essential to shaping the cross-case analysis. The authors are also grateful to Iraia Nuñez and Anna Lorenzini for their dedicated coordination and communications support. While not involved in the technical content of this study, their efforts in organizing exchanges and facilitating collaboration across the projects played a vital role in enabling this synthesis. No generative artificial intelligence tools were used in the generation of this paper.
Funding
This paper was funded by the Innovative Health Initiative through the SYNTHIA project. The study draws on insights from 7 research projects that form part of the HealthData4EU cluster. Projects within the cluster and their funding sources are as follows. SYNTHEMA, PHASE IV AI, SECURED, FLUTE, and AISym4MED are funded by the Horizon Europe program. PHEMS is funded by Horizon Europe and UK Research and Innovation. SYNTHIA is funded through the Innovative Health Initiative.
Conflicts of Interest
SH is employed by VEIL.AI, a company developing data anonymization technologies. The work presented in this manuscript was conducted as part of the PHEMS consortium and was not influenced by the employer, and SH declares no direct financial or commercial interests related to this publication. The other authors have no conflicts of interest to declare.
Overview of all HealthData4EU projects, their objectives, data modalities, use cases, and methodological challenges.
DOCX File , 43 KBReferences
- Eden R, Burton-Jones A, Scott I, Staib A, Sullivan C. Effects of eHealth on hospital practice: synthesis of the current literature. Aust Health Rev. Sep 2018;42(5):568-578. [CrossRef] [Medline]
- Zheng K, Abraham J, Novak LL, Reynolds TL, Gettinger A. A survey of the literature on unintended consequences associated with health information technology: 2014–2015. Yearb Med Inform. Mar 06, 2018;25(01):13-29. [CrossRef]
- Declerck J, Lee J, Sen A, Palmeri A, Oostenbrink R, Giannuzzi V, et al. The potential to leverage real-world data for pediatric clinical trials: a proof-of-concept study. J Med Internet Res. May 30, 2025;27:e72573. [FREE Full text] [CrossRef] [Medline]
- Wang Z, Penning M, Zozus M. Analysis of anesthesia screens for rule-based data quality assessment opportunities. Stud Health Technol Inform. 2019;257:473-478. [FREE Full text] [Medline]
- Wiebe N, Xu Y, Shaheen AA, Eastwood C, Boussat B, Quan H. Indicators of missing electronic medical record (EMR) discharge summaries: a retrospective study on Canadian data. Int J Popul Data Sci. Dec 11, 2020;5(1):1352. [FREE Full text] [CrossRef] [Medline]
- Declerck J, Kalra D, Vander Stichele R, Coorevits P. Frameworks, dimensions, definitions of aspects, and assessment methods for the appraisal of quality of health data for secondary use: comprehensive overview of reviews. JMIR Med Inform. Mar 06, 2024;12:e51560. [FREE Full text] [CrossRef] [Medline]
- Shabani M, Marelli L. Re-identifiability of genomic data and the GDPR: assessing the re-identifiability of genomic data in light of the EU General Data Protection Regulation. EMBO Rep. Jun 2019;20(6):e48316. [FREE Full text] [CrossRef] [Medline]
- Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, Davidson BN, et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC). 2016;4(1):1244. [FREE Full text] [CrossRef] [Medline]
- Declerck J, Vandenberk B, Deschepper M, Colpaert K, Cool L, Goemaere J, et al. Building a foundation for high-quality health data: multihospital case study in Belgium. JMIR Med Inform. Dec 20, 2024;12:e60244. [FREE Full text] [CrossRef] [Medline]
- Wei W, Leibson CL, Ransom JE, Kho AN, Caraballo PJ, Chai HS, et al. Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus. J Am Med Inform Assoc. 2012;19(2):219-224. [FREE Full text] [CrossRef] [Medline]
- Oja M, Tamm S, Mooses K, Pajusalu M, Talvik H, Ott A, et al. Transforming Estonian health data to the Observational Medical Outcomes Partnership (OMOP) Common Data Model: lessons learned. JAMIA Open. Dec 2023;6(4):ooad100. [FREE Full text] [CrossRef] [Medline]
- Gonzales A, Guruswamy G, Smith SR. Synthetic data in health care: a narrative review. PLOS Digit Health. Jan 2023;2(1):e0000082. [FREE Full text] [CrossRef] [Medline]
- Giuffrè M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. NPJ Digit Med. Oct 09, 2023;6(1):186. [FREE Full text] [CrossRef] [Medline]
- Al-Dhamari I, Abu Attieh H, Prasser F. Synthetic datasets for open software development in rare disease research. Orphanet J Rare Dis. Jul 15, 2024;19(1):265. [FREE Full text] [CrossRef] [Medline]
- Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol. May 07, 2020;20(1):108. [FREE Full text] [CrossRef] [Medline]
- Park N, Mohammadi M, Gorde K, Jajodia S, Park H, Kim Y. Data synthesis based on generative adversarial networks. Proc VLDB Endow. Jun 2018;11(10):1071-1083. [CrossRef]
- Esteban C, Hyland S, Rätsch G. Real-valued (medical) time series generation with recurrent conditional gans. ArXiv. Preprint posted online on December 4, 2017. [CrossRef]
- Hashemi A, Soliman A, Lundström J, Etminani K. Domain knowledge-driven generation of synthetic healthcare data. In: Volume 302: Caring is Sharing – Exploiting the Value in Data for Health and Innovation. Amsterdam, Netherlands. IOS Press eBooks; May 22, 2023:352-353.
- Ogwel B, Mzazi VH, Awuor AO, Otieno G, Ogolla S, Nyawanda BO, et al. A quarter-century of synthetic data in healthcare: Unveiling trends with structural topic modeling. Digit Health. 2025;11:20552076251404530. [FREE Full text] [CrossRef] [Medline]
- Alaa A, Van BB, Saveliev E, Van DSM. editors. arXiv. Jul 13, 2022:1-17. [CrossRef]
- Stenger M, Leppich R, Foster I, Kounev S, Bauer A. Evaluation is key: a survey on evaluation measures for synthetic time series. J Big Data. May 07, 2024;11(1):66. [CrossRef]
- Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. Jan 2019;25(1):44-56. [CrossRef] [Medline]
- Jordon J, Szpruch L, Houssiau F, Bottarelli M, Cherubin G, Maple C. Synthetic data -- what, why and how? ArXiv. Preprint posted online on May 6, 2022. [CrossRef]
- Kaabachi B, Despraz J, Meurers T, Otte K, Halilovic M, Kulynych B, et al. A scoping review of privacy and utility metrics in medical synthetic data. NPJ Digit Med. Jan 27, 2025;8(1):60. [FREE Full text] [CrossRef] [Medline]
- Liaw S, Guo JGN, Ansari S, Jonnagaddala J, Godinho MA, Borelli AJ, et al. Quality assessment of real-world data repositories across the data life cycle: a literature review. J Am Med Inform Assoc. Jul 14, 2021;28(7):1591-1599. [FREE Full text] [CrossRef] [Medline]
- Pilgram L, Ko H, Tung A, El Emam K. Protecting patient privacy in tabular synthetic health data: a regulatory perspective. NPJ Digit Med. Nov 28, 2025;8(1):732. [FREE Full text] [CrossRef] [Medline]
Abbreviations
| AI: artificial intelligence |
| GDPR: General Data Protection Regulation |
| LOINC: Logical Observation Identifiers Names and Codes |
| SDG: synthetic data generation |
Edited by A Mavragani; submitted 01.Sep.2025; peer-reviewed by D Liu, M Deschepper, L Busatta, N Jackson, A Hashemi; comments to author 29.Oct.2025; revised version received 04.Mar.2026; accepted 24.Mar.2026; published 29.Apr.2026.
Copyright©Jens Declerck, Dipak Kalra, Antti Airola, Ahmed Youssef Ali Amer, Christos Chatzichristos, Maria del Mar Mañu Pereira, Bruno M. de Brito Robalo, Francesco Ghini, Alberto Gutierrez-Torre, Sem Hoogteijling, Susanne Hultsch, Jan Ramon, Sara Reidel, Francesco Regazzoni, Luís Silva, Inês Silveira, Tsekeridou Sofia, Christophe Maes. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 29.Apr.2026.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

